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1  Introduction 

VSIPL++1  is  the  object-oriented  “next-generation”  version  of  the  Vector  Signal  and  Image  Pro¬ 
cessing  Library  (VSIPL).2  Like  VISPL,  VSIPL++  specifies  an  Application  Programming  Interface 
(API)  for  use  in  the  development  of  high-performance  numerical  applications,  with  a  particular  focus 
on  embedded  real-time  systems  performing  signal  processing  and  image  processing.  VSIPL+- 1-  con¬ 
tains  a  number  of  improvements  relative  to  VSIPL  including  a  simpler,  more  intuitive  programming 
model,  simpler  syntax,  and  greater  flexibility  and  extensibility.  The  most  significant  of  VSIPL+- l-’s 
improvements  is  its  support  for  multi-processor  computation.  This  parallel  support  requires  only 
that  the  user  specify  the  way  in  which  data  should  be  distributed  across  processors.  The  VSIPL+- 1- 
library  automatically  manages  the  transmission  of  data  between  the  processors  as  necessary  to  ef¬ 
fectively  perform  the  desired  computations. 

CodeSourcery  was  awarded  funding  under  the  Air  Force  Small  Business  Investment  Research 
(SBIR)  program  to  develop  a  prototype  version  of  the  parallel  functionality  described  in  the 
VSIPL+- 1-  specification  and  to  obtain  measurements  of  VSIPL+- 1-  performance  when  executing  on 
parallel  systems.3  Our  prototype  implementation  achieves  a  near-linear  speedup  on  multi-processor 
systems  demonstrating  that,  despite  the  high  level  of  abstraction  present  in  VSIPL++,  it  is  nev¬ 
ertheless  possible  to  obtain  excellent  performance.  Thus,  VSIPL+- 1-  has  the  potential  to  allow 
programmers  to  easily  and  rapidly  develop  systems  that  are  both  highly  portable  and  highly  effi¬ 
cient.  In  our  presentation,  we  will  describe  the  parallel  VSIPL+- 1-  programming  model,  our  parallel 
performance  benchmark,  and  the  results  we  obtained. 


2  Benchmark  Description 

Beamforming  is  the  detection  of  energy  propagating  in  a  particular  direction  while  rejecting  energy 
propagating  in  other  directions.  A  beamformer  consists  of  an  array  of  sensors  capturing  signals  and  a 
signal  processing  algorithm  to  extract  signals  from  one  or  more  particular  directions  and  one  or  more 
particular  frequencies.  The  k-fl  beamformer  we  consider  assumes  uniform  spacing  of  omnidirectional 
individual  sensors  along  the  a;-axis.  No  assumptions  about  the  signal’s  structure  are  made  except 
that  the  signal  is  periodic  and  that  the  signal  source  is  sufficiently  far  away  that  the  signal  appears 
planar  to  the  sensors,  and  noise  is  assumed  to  be  uniformly  distributed  across  the  signal. 

The  beamformer  computes  the  power  of  the  incoming  signal  for  various  bearings  (fc)  and  fre¬ 
quencies  (fl).  Those  fc-fl  pairs  where  the  power  is  strongest  indicate  incoming  signals.  Each  sensor 
samples  the  input  signal  over  time.  After  enough  samples  have  been  obtained,  a  computation  is 
performed  (involving  the  inputs  from  all  of  the  sensors)  to  determine  the  power  for  the  k-fl  pairs. 
First,  FIR  filters  remove  higher-order  frequencies  from  the  signal  matrix.  Then  a  real-to-complex 
FFT  is  applied  to  the  rows  of  the  matrix,  optionally  the  data  is  reordered  into  a  column-major 
matrix,  and  finally  a  complex-to-complex  FFT  is  applied  to  the  columns.  Generally,  the  collection 
of  data  and  determining  of  power  is  repeated  multiple  times.  The  final  power  reported  for  a  given 
k-fl  pair  is  the  average  of  that  computed  for  the  various  iterations  of  the  process. 

xhttp : //www.hpec-si . org/private/vsipl++specif ication.html 

“http : //www. vsipl . org 

3SBIR  Contract  FA87450-04-C-0017 
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3  Implementation 

We  implemented  the  beamformer  using  several  different  programming  methodologies.  One  imple¬ 
mentation  was  written  using  C  and  VSIPL.  After  that  implementation  was  complete,  we  developed 
a  C++  and  VSIPL++  implementation.  The  VSIPL  and  VSIPL++  implementations  are  similar  in 
structure,  but  the  VSIPL++  implementation  is  shorter  than  the  VSIPL  implementation  because  of 
the  higher  levels  of  abstraction  provided  by  VSIPL++.  Each  implementation  runs  the  fc-fi  beam- 
former  multiple  times  and  computes  a  “running  average”  power  spectra. 

We  modified  the  VSIPL++  reference  implementation,  developed  by  CodeSourcery  under  contract 
from  MIT  Lincoln  Laboratory,  to  contain  support  for  a  subset  of  the  functionality  being  considered 
for  the  parallel  VSIPL++  specification.  In  particular,  we  created  a  data  storage  abstraction  called 
DistributedBlock  to  represent  a  single  one-  or  two-dimensional  array  that  is  stored  across  multiple 
processors.  We  specialized  VSIPL++  algorithms,  e.g.,  computing  FIR  filters  and  FFTs,  to  perform 
only  local  computations  when  operating  on  a  DistributedBlock.  We  also  modified  VSIPL++  to 
implement  a  specialization  of  the  two-dimensional  FFT  algorithm  so  that,  when  the  input  is  a  row- 
distributed  DistributedBlock  and  the  output  is  a  column-distributed  block,  the  algorithm  performs 
the  “corner-turn”  required. 

The  VSIPL++  specification  is  written  so  as  to  be  independent  of  any  particular  message-passing 
or  threading  system.  However,  in  our  implementation  we  chose  to  use  the  popular  Message  Passing 
Interface  (MPI)4  to  transmit  data  between  cooperating  processors. 

After  making  these  modifications  to  the  VSIPL-I — I-  implementation,  we  made  minor  changes  to 
our  VSIPL++  benchmark  program.  These  changes  consisted  only  of  modifications  to  the  types  used 
to  declare  particular  arrays  in  the  benchmark  program.  For  example,  some  arrays  were  modified  to 
use  DistributedBlock  to  indicate  distribution  across  processors.  The  types  of  these  arrays  indicate 
the  arrays  are  distributed  by  rows  or  columns. 


4  Results 

The  following  table  demonstrates  that  we  were  able  to  obtain  a  near-linear  speedup  with  parallel 
VSIPL++  relative  to  serial  VSIPL++  and  VSIPL.  Times  are  shown  for  the  VSIPL  implementation 
of  the  benchmark,  the  serial  VSIPL++  implementation,  and  the  parallel  VISPL++  implementation 
with  one  and  two  processors.  Times  for  two  hundred  iterations  of  one  problem  instance  are  presented. 
Times  for  other  instances  are  similar  and  will  be  presented  in  the  extended  version  of  this  paper.  In 
all  cases,  the  times  were  obtained  by  running  on  a  dual-processor  Intel  Pentium  4  Xeon  GNU/Linux 
machine.  The  times  given  reflect  only  time  spent  in  the  execution  of  the  beamforming  computations. 
They  do  not  include  time  required  for  initialization  and  finalization  of  the  application  and  its  libraries. 
All  times  shown  are  “wall-clock  time,”  i.e. ,  the  total  number  of  seconds  required  to  execute  the 
beamformer  including  time  spent  in  the  operating  system  kernel. 


serial,  no  corner  turn 

distributed  VSIPL++ 

VSIPL 

VSIPL++ 

1-processor 

2-processor 

FIR  filter 

20.7 

20.7  +  0.1 

20.7  +  0.3 

20.7/2  +  0.2 

1st  FFT 

12.7 

12.7  +  0.1 

12.7  +  0.0 

12.7/2  +  0.3 

Corner  Turn 

— 

— 

6.2  +  2.9 

6.0  +  8.9 

2nd  FFT 

10.0  +  9.2 

10.0  +  9.1 

10.0 

10.0/2  +  0.1 

The  corner-turn  times  indicate  the  seconds  required  to  transpose  a  row-major  matrix  to  a  column- 
major  matrix  plus  the  time  for  an  MPI  All-to-All  computation.  With  one  processor,  only  the 
transpose  occurs.  With  two  processors,  MPI  communication  also  occurs.  The  serial  implementations 
do  not  perform  the  transpositions,  leading  to  an  increase  in  the  time  for  the  second  FFT. 


4 http : //www-unix .mcs . anl . gov/mpi/ 
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Challenge 


o  "Object  oriented  technology  reduces 
software  cost." 

o  "Fully  utilizing  HPEC  systems  for  SIP 
applications  requires  managing  operations 
at  the  lowest  possible  level." 

o  "There  is  great  concern  that  these  two 
approaches  may  be  fundamentally  at 
odds." 
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Parallel  Performance  Vision 
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“Drastically  reduce  the  performance  ^  “Automated  to  reduce 

penalties  associated  with  deploying  implementation  cost.” 

object-oriented  software  on  high 
performance  parallel  embedded 
systems.” 
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Advantages  of  VSIPL 


o  Portability 

Code  can  be  reused  on  any  system  for  which  a  VSIPL 
implementation  is  available. 

o  Performance 

Vendor-optimized  implementations  perform  better  than 
most  handwritten  code. 


©  Productivity 

Reduces  SLOC  count. 
Code  is  easier  to  read. 


Skills  learned  on  one  project  are  applicable  to  others. 
Eliminates  use  of  assembly  code. 
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Limitations  of  VSIPL 


©  Uses  C  Programming  Language 

"Modern  object  oriented  languages  (e.g.,  C++)  have 
consistently  reduced  the  development  time  of  software 
projects."  31^' 

o  Manual  memory  management. 

Cumbersome  syntax. 

©  Inflexible 

Abstractions  prevent  users  from  adding  new  high- 
performance  functionality. 

No  provisions  for  loop  fusion. 

No  way  to  avoid  unnecessary  block  copies. 

©  Not  Scalable 

o  No  support  for  MPI  or  threads. 

SIMD  support  must  be  entirely  coded  by  vendor;  user 
cannot  take  advantage  of  SIMD  directly. 
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Parallelism:  Current  Practice 


MPI  used  for  communication,  but: 

o  MPI  code  often  a  significant  fraction  of 
total  program  code. 

o  MPI  code  notoriously  hard  to  debug. 

o  Tendency  to  hard-code  number  of 


processors,  data  sizes,  etc. 

o  Reduces  portability! 


k 


4 


Conclusion:  users  should  specify  only  data 
layout. 
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Atop  VSIPL's  Foundation 


VSIPL  VSIPL++ 
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Leverage  VSIPL  Model 


o  Same  terminology: 

o  Blocks  store  data, 
o  Views  provide  access  to  data. 
©  Etc. 

o  Same  basic  functionality: 
o  Element-wise  operations. 
©Signal  processing. 

©  Linear  algebra. 
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VSIPL++  Status 


©  Serial  Specification:  Version  1.0a 

Support  for  all  functionality  of  VSIPL. 

Flexible  block  abstraction  permits  varying  data  storage 
formats. 

Specification  permits  loop  fusion,  efficient  use  of 
storage. 

Automated  memory  management. 

©  Reference  Implementation:  Version  0.95 

Support  for  functionality  in  the  specification. 

Used  in  several  demo  programs  —  see  next  talks. 

Built  atop  VSIPL  reference  implementation  for  maximum 
portability. 

o  Parallel  Specification:  Version  0.5 

High-level  design  complete. 
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k-Q  Beamformer 


Input: 

o  Noisy  signal 
arriving  at  a  row  of 
uniformly 
distributed 
sensors. 

Output: 

o  Bearing  and 
frequency  of  signal 
sources. 
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SIP  Primitives  Used 


o  Computation: 

©  FIR  filters 

o  Element-wise  operations  (e.g,  magsq) 
©  FFTs 

o  Minimum/average  values 

o  Communication: 

©  Corner-turn 

All-to-all  communication 

o  Minimum/average  values 

Gather 
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Computation 


1.  Filter  signal  to  remove  high- 
frequency  noise,  (fir) 

2.  Remove  side-lobes  resulting  from 
discretization  of  data,  (muit) 

3.  Apply  Fourier  transform  in  time 
domain,  cffd 

4.  Apply  Fourier  transform  in  space 
domain,  cffd 

5.  Compute  power  spectra,  (muit, magSq) 
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Diagram  of  the  Kernel 
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VSIPL  Kernel 


Seven  statements  required: 


for  (i  =  n;  i  >  0;  --i)  { 

filtered  =  filter  (firs,  signal); 
vsip_mmul_f  (weights,  filtered,  filtered); 
vsip_rcf f tmpop_f  (space_fft,  filtered, 

fft_output) ; 

vsip_ccf f tmpi_f  (time_fft,  fft_output) ; 
vsip_mcmagsq_f  (fft_output,  power) ; 
vsip_ssmul_f  (1.0  /  n,  power) ; 
vsip  madd  f  (power,  spectra,  spectra) ; 
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VSIPL++  Kernel 


One  statement  required: 


for  (i  =  n 

;  i 

>  0;  - 

-i) 

spectra 

+= 

1/n  * 

magsq 

( 

time 

fft 

( space 

_fft 

(weights 

* 

filter 

(firs. 

signal) ) ) ; 

No  changes  are  required  for  distributed  operation. 
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Distribution  in  User  Code 


Serial  case: 

Matrix<float  t,  Dense<2,  float  t>  > 
signal  matrix; 

Parallel  case: 

Use 

typedef  Dense<2,  float  t>  subblock; 
typedef  Distributed<2 ,  float  t,  subblock,  ROW> 
Block2R  t; 

Matrix<float  t,  Block2R  t>  signal  matrix; 

r  writes  no  MPI  code. 
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VSIPL++  Implementation 


© Added  DistributedBlock: 

©Uses  a  "standard"  VSIPL+  +  block  on 
each  processor. 

ollses  MPI  routines  for  communication 
when  performing  block  assignment. 

©Added  specializations: 

©FFT,  FIR,  etc.  modified  to  handle 

DistributedBlock. 
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Performance  Measurement 


oTest  system: 

©AFRL  HPC  system 
o2.2GHz  Pentium  4  cluster 

o  Measured  only  main  loop 

o  No  input/output 

o  Used  Pentium  Timestamp  Counter 
©  MPI  All-to-all  not  included  in  timings 

©Accounts  for  10-25% 


Qp  CodeSourcery 


18 


VSIPL++  Performance 


CooeSourcery 

LLC 
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Parallel  Speedup 
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Conclusions 


0VSIPL++  imposes  no  overhead: 

©VSIPL++  performance  nearly  identical 
to  VSIPL  performance. 

0VSIPL++  achieves  near-linear 
parallel  speedup: 

0N0  tuning  of  MPI,  VSIPL++,  or 
application  code. 

o  Absolute  performance  limited  by 
VSIPL  implementation,  MPI 
implementation,  compiler. 
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VSIPL+  + 


Visit  the  HPEC-SI  website 

http://www.hpec-si.org 

o  for  VSIPL+  +  specifications 

o  for  VSIPL++  reference 
implementation 

©to  participate  in  VSIPL+  + 
development 
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